Instructions¶

  1. Labeling & Peer Grading: Your homework will be peer graded. To stay anonymous, avoid using your name and label your file with the last four digits of your student ID (e.g., HW#_Solutions_3938).

  2. Submission: Submit both your IPython notebook (.ipynb) and an HTML file of the notebook to Canvas under Assignments → HW # → Submit Assignment. After submitting, download and check the files to make sure that you've uploaded the correct versions. Both files are required for your HW to be graded.

  3. No pdf file required so write all the details in your ipynb file.
  4. AI Use Policy: Solve each problem independently by yourself. Use AI tools like ChatGPT or Google Gemini for brainstorming and learning only—copying AI-generated content is prohibited. You do not neeViolations will lead to penalties, up to failing the course.

  5. Problem Structure: Break down each problem ( already done in most problems) into three interconnected parts and implement each in separate code cells. Ensure that each part logically builds on the previous one. Include comments in your code to explain its purpose, followed by a Markdown cell analyzing what was achieved. After completing all parts, add a final Markdown cell reflecting on your overall approach, discussing any challenges faced, and explaining how you utilized AI tools in your process.

  6. Deadlines & Academic Integrity: This homework is due on 11/05/2024 at midnight. Disclosure of this assignment and assignment answers to anyone or any website is a contributory infringement of academic dishonesty at ISU. Do not share or post course materials without the express written consent of the copyright holder and instructor. The class will follow Iowa State University’s policy on academic dishonesty. Anyone suspected of academic dishonesty will be reported to the Dean of Students Office.

Each problem is worth 25 points. Total $\bf 25\times 4 = 100$.¶

Select 10 Stocks for the Portfolio Based on Positive Sentiment Trends¶

Problem 1.¶

Upload the sn_ids.csv and do the following. Make sure to explain all the details for each part.

  • Apply the PageRank algorithm to compute the rank for each ID in the network. Sort the results in descending order, round the PageRank values to 3 decimals, and create a DataFrame with two columns: "ids" and "PageRank."
  • From the PageRank DataFrame, select the top 5 IDs and use them to filter the original sn_ids.csv based on 'id_1' and 'id_2'. Plot the network graph of the filtered data with figsize(50, 50) and use different colors for "id_1" and "id_2."
  • Repeat part 1 using the HITS algorithm to compute authority and hub scores. Then, select the top 5 IDs based on authority values and repeat part 2 to filter and plot the network graph.
In [61]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import json
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', 20)
In [63]:
# Load the config file
with open('config.json', 'r') as f:
    config = json.load(f)

data_loc = config["data_loc"]
file_name = "sn_ids.csv"
In [64]:
sn_df = pd.read_csv(data_loc + file_name)
rows, columns = sn_df.shape
print(f"The dataset contains {rows:,} rows and {columns} columns")
sn_df.head(5)
The dataset contains 289,003 rows and 2 columns
Out[64]:
id_1 id_2
0 0 23977
1 1 34526
2 1 2370
3 1 14683
4 1 29982
  • Apply the PageRank algorithm to compute the rank for each ID in the network. Sort the results in descending order, round the PageRank values to 3 decimals, and create a DataFrame with two columns: "ids" and "PageRank."
In [12]:
import networkx as nx
import matplotlib.pyplot as plt
from sknetwork.data import Bunch 
from sknetwork.ranking import PageRank 
from scipy.sparse import csr_matrix
In [108]:
# Convert the data to a directed graph
G = nx.from_pandas_edgelist(sn_df, 'id_1', 'id_2', create_using=nx.DiGraph)

# Convert the NetworkX graph to a sparse CSR matrix
adjacency = csr_matrix(nx.to_scipy_sparse_array(G, dtype=None, weight='weight', format='csr'))
names = np.array(list(G.nodes()))
graph = Bunch()
graph.adjacency = adjacency
graph.names = names

# Apply the PageRank algorithm
pagerank = PageRank()
pagerank.fit(adjacency)
scores = pagerank.scores_
scores = [round(score, 3) for score in scores]

# Convert the PageRank scores to a DataFrame
pagerank_df = pd.DataFrame({'ids': names, 'PageRank': scores}).sort_values(by='PageRank', ascending=False).reset_index(drop=True)
pagerank_df.head()
Out[108]:
ids PageRank
0 31890 0.015
1 36652 0.013
2 18163 0.010
3 36628 0.010
4 34114 0.006
  • From the PageRank DataFrame, select the top 5 IDs and use them to filter the original sn_ids.csv based on 'id_1' and 'id_2'. Plot the network graph of the filtered data with figsize(50, 50) and use different colors for "id_1" and "id_2."
In [103]:
# Select the top 5 IDs
top_5_ids = pagerank_df['ids'].head(5).tolist()

# Filter data based on the top 5 IDs
filtered_df = sn_df[(sn_df['id_1'].isin(top_5_ids)) | (sn_df['id_2'].isin(top_5_ids))]
# filtered_df = filtered_df[0:2000]
filtered_df
Out[103]:
id_1 id_2
22 6 31890
38 34957 31890
45 8 36652
69 10 31890
123 11 31890
... ... ...
288973 12628 34114
288976 34114 37535
288977 34114 37431
288978 34114 37460
288979 34114 2730

15631 rows × 2 columns

Plot the Network Graph¶

Network Graph Based on PageRank Analysis (with full filtered data)¶

In [96]:
# Create the directed graph from the filtered data
G = nx.from_pandas_edgelist(filtered_df, 'id_1', 'id_2', create_using=nx.DiGraph())

# Set up plot with figsize of 50x50
plt.figure(figsize=(50, 50))

# Position the nodes using a spring layout
pos = nx.spring_layout(G, k=0.1)

# Draw nodes with different colors for "id_1" and "id_2"
id_1_nodes = filtered_df['id_1'].unique()
id_2_nodes = filtered_df['id_2'].unique()

# Draw nodes from "id_1" in blue and nodes from "id_2" in green
nx.draw_networkx_nodes(G, pos, nodelist=id_1_nodes, node_color='skyblue', node_size=800)
nx.draw_networkx_nodes(G, pos, nodelist=id_2_nodes, node_color='red', node_size=800)

# Draw edges
nx.draw_networkx_edges(G, pos, edge_color='black', width=1)

# Set the plot title
plt.title(f"Network Graph Based on PageRank Score", size=80)
plt.show()

Based from the above network graph, is hard to derive conlusions and insights due to the high volume of the dataset.

The initial dataset has 289,003 connections. After selecting the top 5 IDs based on the PageRank score and use them to filter the original sn_ids.csv based on 'id_1' and 'id_2', the volume of the dataset reduces to 15,631 connections. This is represents a 94.59% data volume reduction.

To make the graph more interpretable, a further reduction or sampling of the filtered data is necessary. By visualizing a smaller subset, we can better observe the hierarchical or clustered structures around these influential nodes. This refined approach facilitates a clearer understanding of relationships and potential key influencers in the network, offering valuable insights for applications like social network analysis or recommendation systems.

Network Graph Based on PageRank Analysis (with a subset of filtered data)¶

In [109]:
# Select the top 5 IDs
top_5_ids = pagerank_df['ids'].head(5).tolist()

# Filter data based on the top 5 IDs
filtered_df = sn_df[(sn_df['id_1'].isin(top_5_ids)) | (sn_df['id_2'].isin(top_5_ids))]
filtered_df = filtered_df[0:2000]

# Create the directed graph from the filtered data
G = nx.from_pandas_edgelist(filtered_df, 'id_1', 'id_2', create_using=nx.DiGraph())

# Set up plot with figsize of 50x50
plt.figure(figsize=(50, 50))

# Position the nodes using a spring layout
pos = nx.spring_layout(G, k=0.1)

# Draw nodes with different colors for "id_1" and "id_2"
id_1_nodes = filtered_df['id_1'].unique()
id_2_nodes = filtered_df['id_2'].unique()

# Draw nodes from "id_1" in blue and nodes from "id_2" in green
nx.draw_networkx_nodes(G, pos, nodelist=id_1_nodes, node_color='skyblue', node_size=800)
nx.draw_networkx_nodes(G, pos, nodelist=id_2_nodes, node_color='red', node_size=800)

# Draw edges
nx.draw_networkx_edges(G, pos, edge_color='black', width=1)

# Set the plot title
plt.title(f"Network Graph Based on PageRank Score with a Subset of the Filtered Data", size=40)
plt.show()

This network graph, based on PageRank scores with a subset of filtered data, highlights the influence of key nodes (in red) within the network. These red nodes, positioned centrally, connect to multiple clusters, indicating their role as influential hubs. The structure shows a distinct hub-and-spoke pattern, with central nodes linking to numerous smaller nodes. This setup suggests a hierarchical relationship, where these hubs act as primary connectors, facilitating interaction across different parts of the network. Despite the data reduction, the high connectivity still demonstrates the importance of these influential nodes in maintaining the network's overall structure.

  • Repeat part 1 using the HITS algorithm to compute authority and hub scores. Then, select the top 5 IDs based on authority values and repeat part 2 to filter and plot the network graph.

Compute HITS Algorithm and Extract Hub and Authority Scores¶

In [111]:
# Convert the data to a directed graph
G = nx.from_pandas_edgelist(sn_df, 'id_1', 'id_2', create_using=nx.DiGraph())
H = G.to_directed()

# Apply the HITS algorithm
hubs, authorities = nx.hits(H, max_iter = 50, normalized = True)
ids = list(authorities.keys())
hub_scores = [round(value, 3) for value in hubs.values()]
authorities_scores = [round(value, 3) for value in list(authorities.values())]

# Convert the authority scores into a DataFrame
authority_df = pd.DataFrame({
    'ids': ids, 
    'Hub Score': hub_scores,
    'Authority': authorities_scores     
}).sort_values(by='Authority', ascending=False).reset_index(drop=True)

authority_df.head()
Out[111]:
ids Hub Score Authority
0 31890 0.001 0.010
1 35773 0.001 0.003
2 36652 0.000 0.002
3 19222 0.001 0.002
4 35008 0.000 0.002
In [97]:
# Select the top 5 IDs based on Authority values
top_5_ids = authority_df['ids'].head(5).tolist()

# Filter the original data based on the top 5 Authority IDs
filtered_df = sn_df[(sn_df['id_1'].isin(top_5_ids)) | (sn_df['id_2'].isin(top_5_ids))]
# filtered_df = filtered_df[0:2000]
filtered_df
Out[97]:
id_1 id_2
22 6 31890
28 7 35773
38 34957 31890
45 8 36652
69 10 31890
... ... ...
288795 36652 33051
288796 36652 37649
288797 36652 25233
288798 36652 37672
288799 36652 37562

19647 rows × 2 columns

Plot the Network Graph¶

Network Graph Based on HITS Authority Analysis (with full filtered data)¶

In [98]:
# Create a graph for the filtered data
G_filtered = nx.from_pandas_edgelist(filtered_df, 'id_1', 'id_2', create_using=nx.DiGraph())

# Set up plot with figsize of 50x50
plt.figure(figsize=(50, 50))

# Position the nodes using a spring layout
pos = nx.spring_layout(G_filtered, k=0.1)

# Draw nodes with different colors for "id_1" and "id_2"
id_1_nodes = filtered_df['id_1'].unique()
id_2_nodes = filtered_df['id_2'].unique()

# Draw nodes from "id_1" in blue and nodes from "id_2" in green
nx.draw_networkx_nodes(G_filtered, pos, nodelist=id_1_nodes, node_color='skyblue', node_size=800)
nx.draw_networkx_nodes(G_filtered, pos, nodelist=id_2_nodes, node_color='red', node_size=800)

# Draw edges
nx.draw_networkx_edges(G_filtered, pos, edge_color='black', width=0.5)


# Set the plot title
plt.title(f"Network Graph Based on HITS Authority", size=80)
plt.show()

This HITS-based network graph, like the previous PageRank graph, shows a highly dense structure with numerous connections, making it challenging to derive meaningful insights at this level of complexity. The red nodes likely represent high-authority nodes, surrounded by many other nodes connecting to them.

The structure reveals a layered organization with core nodes in the center and outer nodes connected to them, indicating a potential hierarchical or influential relationship. However, the density of connections obscures specific patterns, highlighting the need for further filtering or sampling to create a more interpretable visualization. By focusing on a representative subset of the data, we can better understand the core connection patterns and influence of high-authority nodes within this network.

Network Graph Based on PageRank Analysis (with a subset of filtered data)¶

In [112]:
# Select the top 5 IDs based on Authority values
top_5_ids = authority_df['ids'].head(5).tolist()

# Filter the original data based on the top 5 Authority IDs
filtered_df = sn_df[(sn_df['id_1'].isin(top_5_ids)) | (sn_df['id_2'].isin(top_5_ids))]
filtered_df = filtered_df[0:2000]

# Create a graph for the filtered data
G_filtered = nx.from_pandas_edgelist(filtered_df, 'id_1', 'id_2', create_using=nx.DiGraph())

# Set up plot with figsize of 50x50
plt.figure(figsize=(50, 50))

# Position the nodes using a spring layout
pos = nx.spring_layout(G_filtered, k=0.1)

# Draw nodes with different colors for "id_1" and "id_2"
id_1_nodes = filtered_df['id_1'].unique()
id_2_nodes = filtered_df['id_2'].unique()

# Draw nodes from "id_1" in blue and nodes from "id_2" in green
nx.draw_networkx_nodes(G_filtered, pos, nodelist=id_1_nodes, node_color='skyblue', node_size=800)
nx.draw_networkx_nodes(G_filtered, pos, nodelist=id_2_nodes, node_color='red', node_size=800)

# Draw edges
nx.draw_networkx_edges(G_filtered, pos, edge_color='black', width=0.5)


# Set the plot title
plt.title(f"Network Graph Based on HITS Authority", size=80)
plt.show()

This network graph of 2,000 connections highlights 5 high-authority nodes (in red) with dense clusters of blue nodes connected to them. The red nodes demonstrate strong influence, attracting multiple connections. Some blue nodes link to more than one high-authority node, acting as bridges across clusters and enhancing interconnectivity. This clearer view emphasizes the role of high-authority nodes in structuring the network and connecting communities.

Disclaimer: Problems 2 and 3 are for learning purposes only and not a financial advice.¶

Problem 2.¶

Do the following using the Yahoo Finance package. As usual, write the analysis details and explain all that you do for each part.

  • Download the data for the 20 ticker symbols. Create a DataFrame with 3 columns: Ticker Symbol, Top 10 institutional holders of each ticker, and the corresponding holding amounts in dollars.
  • Using the DataFrame, build a network graph where the institutional holders are the source and ticker symbols are the target. Label the nodes and use different colors for the source (holders) and the target (tickers). Adjust the edge thickness based on normalized holding amounts, and scale the size of ticker symbols by their degree.
  • Modify the graph from part 2 to improve its appearance by making at least one change—such as color, layout, or size—so that the graph looks better in your view.
In [147]:
import pandas as pd
import yfinance as yf
import numpy as np
import requests
In [144]:
top20_tickers = ["AAPL", "AMZN", "MSFT", "GOOG", "GOOGL", "META", "TSLA", "NVDA", "JPM", "JNJ", "V", "PG",
                 "UNH", "HD", "MA", "BAC", "DIS", "PYPL", "NFLX", "ADBE"]
  • Download the data for the 20 ticker symbols. Create a DataFrame with 3 columns: Ticker Symbol, Top 10 institutional holders of each ticker, and the corresponding holding amounts in dollars.
In [146]:
# Initialize an empty DataFrame to store results
all_holders = pd.DataFrame()

for ticker in top20_tickers:
    stock = yf.Ticker(ticker)
    institutional_holders = stock.institutional_holders
    institutional_holders['Ticker'] = ticker
    all_holders = pd.concat([all_holders, institutional_holders.head(10)], ignore_index=True)

all_holders = all_holders[['Ticker', 'Holder', 'Value']]
rows, columns = all_holders.shape
print(f"The dataset contains {rows:,} rows and {columns} columns")
all_holders.head(5)
The dataset contains 200 rows and 3 columns
Out[146]:
Ticker Holder Value
0 AAPL Vanguard Group Inc 252876459508
1 AAPL Blackrock Inc. 201659137420
2 AAPL Berkshire Hathaway, Inc 177591247296
3 AAPL State Street Corporation 112288817516
4 AAPL FMR, LLC 59561715772
  • Using the DataFrame, build a network graph where the institutional holders are the source and ticker symbols are the target. Label the nodes and use different colors for the source (holders) and the target (tickers). Adjust the edge thickness based on normalized holding amounts, and scale the size of ticker symbols by their degree.
In [ ]:
import pandas as pd
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt
from matplotlib.colors import Normalize
from matplotlib.cm import ScalarMappable
from IPython.display import SVG 
from sknetwork.visualization import svg_graph 
from sknetwork.data import Bunch 
from sknetwork.ranking import PageRank
from scipy.sparse import csr_matrix
In [272]:
holder_color = "skyblue"
ticker_color = "red"
holder_size = 1000

# Normalize holding values for edge thickness
all_holders['Normalized Value'] = all_holders['Value'] / all_holders['Value'].max()

# Create directed graph
G = nx.from_pandas_edgelist(all_holders, 'Holder', 'Ticker', ['Value'], create_using=nx.DiGraph())

# Set up plot with figsize of 50x50
plt.figure(figsize=(40, 30))

# Draw nodes
holders = [node for node in G.nodes() if node in all_holders['Holder'].values]
tickers = [node for node in G.nodes() if node in all_holders['Ticker'].values]

# Define positions using spring layout for a more organic structure
pos = nx.spring_layout(G, k=0.3, seed=42)

# Scale ticker sizes by degree (number of connections)
ticker_sizes = [G.degree(ticker) * 500 for ticker in tickers]

# Draw nodes from holders in blue and nodes from tickers in green
nx.draw_networkx_nodes(G, pos, nodelist=tickers, node_color=ticker_color, node_size=ticker_sizes)
nx.draw_networkx_nodes(G, pos, nodelist=holders, node_color=holder_color, node_size=holder_size)

# Draw labels for holders slightly outside the nodes
holder_labels = {node: node for node in holders}
holder_label_pos = {node: (pos[node][0], pos[node][1] + 0.05) for node in holders}  # Offset for visibility
nx.draw_networkx_labels(G, holder_label_pos, labels=holder_labels, font_size=25, verticalalignment="bottom")

# Draw edges with thickness based on normalized holding values
edge_widths = [all_holders.loc[(all_holders['Holder'] == u) & (all_holders['Ticker'] == v), 'Normalized Value'].values[0] * 5 for u, v in G.edges()]
nx.draw_networkx_edges(G, pos, edge_color='black', width=edge_widths)

# Draw labels for tickers inside the nodes
ticker_labels = {node: node for node in tickers}
nx.draw_networkx_labels(G, pos, labels=ticker_labels, font_size=20, font_color="white")

# Create a legend
plt.scatter([], [], c=holder_color, label='Institutional Holders', s=400)
plt.scatter([], [], c=ticker_color, label='Ticker Symbols', s=400)
plt.legend(scatterpoints=1, frameon=False, labelspacing=1, loc='upper right', fontsize=30)

# Set the plot title
plt.title(f"Institutional Ownership Network", size=50)
plt.show()
  • Modify the graph from part 2 to improve its appearance by making at least one change—such as color, layout, or size—so that the graph looks better in your view.
In [276]:
holder_color = "skyblue"
ticker_color = "red"
holder_size = 1000

# Normalize holding values for edge thickness
all_holders['Normalized Value'] = all_holders['Value'] / all_holders['Value'].max()

# Create directed graph
G = nx.from_pandas_edgelist(all_holders, 'Holder', 'Ticker', ['Value'], create_using=nx.DiGraph())

# Set up plot with figsize of 50x50
plt.figure(figsize=(40, 30))

# Draw nodes
holders = [node for node in G.nodes() if node in all_holders['Holder'].values]
tickers = [node for node in G.nodes() if node in all_holders['Ticker'].values]

# Use shell layout to position tickers in the center and holders outside
pos = nx.shell_layout(G, nlist=[tickers, holders])

# Scale ticker sizes by degree (number of connections)
ticker_sizes = [G.degree(ticker) * 500 for ticker in tickers]

# Draw nodes from holders in blue and nodes from tickers in green
nx.draw_networkx_nodes(G, pos, nodelist=tickers, node_color=ticker_color, node_size=ticker_sizes)
nx.draw_networkx_nodes(G, pos, nodelist=holders, node_color=holder_color, node_size=holder_size)

# Draw labels for holders slightly outside the nodes
holder_labels = {node: node for node in holders}
holder_label_pos = {node: (pos[node][0], pos[node][1] + 0.04) for node in holders}  # Offset for visibility
nx.draw_networkx_labels(G, holder_label_pos, labels=holder_labels, font_size=25, verticalalignment="bottom")

# Draw edges with thickness based on normalized holding values
edge_widths = [all_holders.loc[(all_holders['Holder'] == u) & (all_holders['Ticker'] == v), 'Normalized Value'].values[0] * 5 for u, v in G.edges()]
nx.draw_networkx_edges(G, pos, edge_color='black', width=edge_widths)

# Draw labels for tickers inside the nodes
ticker_labels = {node: node for node in tickers}
nx.draw_networkx_labels(G, pos, labels=ticker_labels, font_size=20, font_color="white")

# Create a legend
plt.scatter([], [], c=holder_color, label='Institutional Holders', s=400)
plt.scatter([], [], c=ticker_color, label='Ticker Symbols', s=400)
plt.legend(scatterpoints=1, frameon=False, labelspacing=1, loc='upper right', fontsize=30)
plt.title(f"Institutional Ownership Network: Central Stock Tickers and Outer Institutional Holders", size=45)
plt.show()

This network visualization highlights the relationships between major institutional holders (in blue) and top stock tickers (in red). By centralizing the stock tickers and placing institutional holders around them, the graph clearly illustrates how multiple institutions are connected to popular stocks. The varying edge thickness, based on normalized holding amounts, provides a visual cue of the strength of each institutional investment in a particular stock. The size of the nodes don't vary a lot because all stickers have 10 institutional holders. This layout emphasizes the most influential stocks in attracting significant institutional investment and the diverse range of institutions supporting these key assets.

Problem 3.¶

Do the following and write the findings from your anslysis.

  • Use web scraping to collect 25 news articles or titles for each of the 20 stocks listed in the previous problem. Organize the information into a DataFrame with three columns: Date, Journalist, and Article content (or titles if the full article is not accessible). This approach ensures you can proceed even if gathering full articles is challenging
  • Perform sentiment analysis on the articles collected in part 1 using TextBlob. Sort the resulting DataFrame by sentiment scores to rank the articles by sentiment positivity.
  • Repeat the sentiment analysis using a Naive Bayes classifier. Compare the results from TextBlob and Naive Bayes to select 10 stocks for your portfolio based on the most positive sentiment trends.
In [6]:
import newspaper
from newspaper import Article 
from tqdm import tqdm 
import wikipedia as wiki 
from GoogleNews import GoogleNews
import pandas as pd
  • Use web scraping to collect 25 news articles or titles for each of the 20 stocks listed in the previous problem. Organize the information into a DataFrame with three columns: Date, Journalist, and Article content (or titles if the full article is not accessible). This approach ensures you can proceed even if gathering full articles is challenging
In [2]:
from datetime import datetime, timedelta
import re
import time

# Function to parse relative dates
def parse_relative_date(relative_date_str):
    now = datetime.now()
    
    if 'hour' in relative_date_str:
        hours = int(re.search(r'\d+', relative_date_str).group())
        return now - timedelta(hours=hours)
    elif 'minute' in relative_date_str:
        minutes = int(re.search(r'\d+', relative_date_str).group())
        return now - timedelta(minutes=minutes)
    elif 'day' in relative_date_str:
        days = int(re.search(r'\d+', relative_date_str).group())
        return now - timedelta(days=days)
    elif 'week' in relative_date_str:
        weeks = int(re.search(r'\d+', relative_date_str).group())
        return now - timedelta(weeks=weeks)
    else:
        return now  # If date format is not recognized, default to now
In [7]:
# Define list of stock tickers
top20_tickers = ["AAPL", "AMZN", "MSFT", "GOOG", "GOOGL", "META", "TSLA", "NVDA", "JPM", "JNJ", 
                 "V", "PG", "UNH", "HD", "MA", "BAC", "DIS", "PYPL", "NFLX", "ADBE"]
# top20_tickers = ["AAPL", "AMZN"]

# Define date range for recent articles
googlenews = GoogleNews(lang='en')
googlenews.set_encode('utf-8')

# Initialize an empty list to store article data
articles_data = []

for ticker in tqdm(top20_tickers, desc="Fetching articles for stocks"):

    # Search for news articles related to the ticker
    googlenews.search(ticker)

    # Collect articles from the first pages
    news_results = []
    unique_titles = set()

    for page in range(1, 10):  # Retrieve the first N pages
        googlenews.getpage(page)
        page_results = googlenews.result()
        time.sleep(2) 
        # Filter duplicates by title
        for news in page_results:
            title = news.get('title', None)
            
            # Check if title is unique and we haven't reached 25 articles
            if title and title not in unique_titles:
                unique_titles.add(title)  # Add title to the set of seen titles
                news_results.append(news)  # Add the unique article to results

            # Stop once we have 25 unique articles
            if len(news_results) >= 25:
                break
        if len(news_results) >= 25:
            break
    # print(f"News Lenght: {len(news_results)}")

    # Loop through each news result
    for news in news_results:
        relative_date = news.get('date', None)
        title = news.get('title', None)
        link = news.get('link', None)
        media = news.get('media', None)

        # Convert relative date to actual datetime
        actual_date = parse_relative_date(relative_date) if relative_date else None
        
        # Format the URL
        if "&" in link:
            link = link.split('&')[0]

        # Try to fetch the journalist and content
        journalist = None
        content = None
        try:
            article = Article(link)
            article.download()
            article.parse()
            journalist = article.authors if article.authors else None
            content = article.text if article.text else title
        except:
            content = title  # Use title if full article is not accessible
        
        # Append data to the list
        articles_data.append({
            'Ticker': ticker,
            'Relative Date': relative_date,
            'Date': actual_date.strftime('%Y-%m-%d %H:%M:%S') if actual_date else 'N/A',
            "Media": media,
            'Journalist': ', '.join(journalist) if journalist else 'N/A',
            'Article Content': content,
            'Article Link': link
        })

        # Clear GoogleNews search
        googlenews.clear()

# Convert the list to a DataFrame
articles_df = pd.DataFrame(articles_data)

rows, columns = articles_df.shape
print(f"The dataset contains {rows:,} rows and {columns} columns")
articles_df.head(5)
The dataset contains 500 rows and 7 columns
Out[7]:
Ticker Relative Date Date Media Journalist Article Content Article Link
0 AAPL 1 hour ago 2024-11-04 06:43:12 Simply Wall Street N/A Apple ( ) Full Year 2024 Results\n\nKey Financ... https://simplywall.st/stocks/us/tech/nasdaq-aa...
1 AAPL 1 hour ago 2024-11-04 06:43:13 Benzinga David Pinsen Remembering The Ultimate Example Of DEI\n\nIn ... https://www.benzinga.com/markets/24/11/4171071...
2 AAPL 2 hours ago 2024-11-04 05:43:13 FXLeaders Skerdian Meta aapl-usd\n\nStock markets including Apple stoc... https://www.fxleaders.com/news/2024/11/04/look...
3 AAPL 2 hours ago 2024-11-04 05:43:14 StreetInsider N/A BofA Securities Reiterates Buy Rating on Apple... https://www.streetinsider.com/Analyst%2BCommen...
4 AAPL 5 hours ago 2024-11-04 02:43:15 Defense World Defense World Staff Wealth Dimensions Group Ltd. lifted its stake ... https://www.defenseworld.net/2024/11/04/wealth...

Export DataFrame¶

In [9]:
# Load the config file
with open('config.json', 'r') as f:
    config = json.load(f)

data_loc = config["data_loc"]
file_name = "stocks_news_articles.csv"

file_destination = data_loc +file_name

articles_df.to_csv(file_destination, index=False)
  • Perform sentiment analysis on the articles collected in part 1 using TextBlob. Sort the resulting DataFrame by sentiment scores to rank the articles by sentiment positivity.
In [15]:
import matplotlib.pyplot as plt
from textblob import TextBlob
In [16]:
# articles_dff = pd.read_csv(file_destination)
# articles_dff = articles_dff.fillna(" ")
# articles_dff.head()
In [51]:
# Perform sentiment analysis
articles_df['Sentiment Score'] = articles_df['Article Content'].apply(lambda x: TextBlob(x).sentiment.polarity)

# Sort the DataFrame by sentiment score in descending order (most positive first)
articles_df = articles_df.sort_values(by='Sentiment Score', ascending=False).reset_index(drop=True)

# Display the sorted DataFrame
articles_df
Out[51]:
Ticker Relative Date Date Media Journalist Article Content Article Link Sentiment Score
0 META 2 days ago 2024-11-02 07:45:12 YouTube N/A Magnificent Meta and Microsoft https://www.youtube.com/watch%3Fv%3DTB_dDBD-DGQ 1.000000
1 MSFT 2 days ago 2024-11-02 07:44:14 Seeking Alpha N/A Microsoft: Most Magnificent, Fairly Valued (NA... https://seekingalpha.com/article/4731992-micro... 0.733333
2 V 0 hours ago 2024-11-04 07:47:15 Business Standard Business Standard US election showdown: Latest polls show Harris... https://www.business-standard.com/world-news/u... 0.500000
3 NVDA 2 minutes ago 2024-11-04 07:43:50 Barron's N/A Nvidia, Apple, Sherwin-Williams, DJT, Talen En... https://www.barrons.com/articles/stock-market-... 0.500000
4 BAC 1 day ago 2024-11-03 07:49:47 MSN N/A Buffett's Berkshire Hathaway cuts Apple, BofA ... https://www.msn.com/en-us/money/topstocks/buff... 0.500000
... ... ... ... ... ... ... ... ...
495 META 2 days ago 2024-11-02 07:45:10 Seeking Alpha N/A Meta: AI Train Isn't Slowing Down Anytime Soon https://seekingalpha.com/article/4732314-meta-... -0.155556
496 AMZN 6 minutes ago 2024-11-04 07:37:40 Barron's N/A Talen Stock Tumbles on Amazon Nuclear Power Se... https://www.barrons.com/articles/talen-stock-p... -0.155556
497 PG 1 day ago 2024-11-03 07:47:48 Jagran English N/A NEET PG Counselling 2024 Schedule: The Medical... https://english.jagran.com/education/neet-pg-c... -0.163636
498 DIS 6 hours ago 2024-11-04 01:50:16 Tech in Asia N/A If you're seeing this message, that means Java... https://www.techinasia.com/news/disney-forms-b... -0.200000
499 PYPL 3 days ago 2024-11-01 07:50:46 Seeking Alpha N/A PayPal: Just 3 Million Shy Of Making Another R... https://seekingalpha.com/article/4731471-paypa... -0.500000

500 rows × 8 columns

Sentiment Analysis¶

Summary Statistics of Sentiment Scores¶

In [52]:
# Summary statistics
print("Summary Statistics of Sentiment Scores:")
print("Mean Sentiment Score:", articles_df['Sentiment Score'].mean())
print("Median Sentiment Score:", articles_df['Sentiment Score'].median())
print("Minimum Sentiment Score:", articles_df['Sentiment Score'].min())
print("Maximum Sentiment Score:", articles_df['Sentiment Score'].max())
Summary Statistics of Sentiment Scores:
Mean Sentiment Score: 0.09884682181507255
Median Sentiment Score: 0.07927035330261134
Minimum Sentiment Score: -0.5
Maximum Sentiment Score: 1.0

Overall Sentiment:¶

The mean sentiment score is 9.98%, and the median is 7.93%, indicating a slightly positive overall sentiment.

The sentiment scores range from a minimum of -0.5 (most negative) to a maximum of 1 (most positive).

Distribution of Sentiment Scores¶

In [53]:
# Histogram of sentiment scores
plt.figure(figsize=(10, 6))
plt.hist(articles_df['Sentiment Score'], bins=20, color='skyblue', edgecolor='black')
plt.title("Distribution of Sentiment Scores")
plt.xlabel("Sentiment Score")
plt.ylabel("Number of Articles")
plt.show()

The histogram of sentiment scores shows a clustering around slightly positive scores, with fewer articles exhibiting high positivity or negativity. This distribution suggests that most articles have a balanced tone, with only a few articles at sentiment extremes.

Categorize Articles by Sentiment¶

In [54]:
# Categorize articles by sentiment
articles_df['Sentiment Category'] = articles_df['Sentiment Score'].apply(
    lambda x: 'Positive' if x > 0 else ('Negative' if x < 0 else 'Neutral')
)

# Count the number of articles in each sentiment category
category_counts = articles_df['Sentiment Category'].value_counts()
print("\nNumber of Articles by Sentiment Category:")
print(category_counts)
Number of Articles by Sentiment Category:
Sentiment Category
Positive    404
Neutral      53
Negative     43
Name: count, dtype: int64

Positive Articles: The majority of articles (404) have a positive sentiment.\ Neutral Articles: A significant portion (53) is neutral, showing balanced or neutral news coverage.\ Negative Articles: A smaller number of articles (43) are negative, indicating that negative sentiment is less common.

Distribution of sentiment categories for each stock¶

In [55]:
# Group data by Ticker and Sentiment Category
sentiment_counts = articles_df.groupby(['Ticker', 'Sentiment Category']).size().unstack(fill_value=0).sort_values(by=['Negative', 'Neutral'], ascending=False)

# Plot a stacked bar chart
sentiment_counts.plot(kind='bar', stacked=True, figsize=(12, 8), color=['darkred', 'yellow', 'green'])
plt.title('Sentiment Distribution by Ticker')
plt.xlabel('Ticker')
plt.ylabel('Count of Articles')
plt.legend(title='Sentiment Category')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

This stacked bar chart illustrates the distribution of sentiment across different stock tickers. Notably, JPM, BAC, MA, and TSLA have a more significant proportion of negative (red) and neutral (yellow) sentiments, while many other stocks are predominantly positive (green). This trend may indicate that JPM, BAC, and TSLA are currently experiencing more mixed or negative media coverage, whereas stocks like GOOGL, ADBE, and META are almost entirely positive, suggesting a generally favorable sentiment around these tickers. This insight can be useful for identifying stocks that might face sentiment-driven volatility or positive momentum.

Top Positive and Negative Articles¶

In [56]:
# Top 5 positive articles
top_positive = articles_df.head(5)
print("\nTop 5 Positive Articles:")
top_positive[['Ticker', 'Date', 'Article Content', 'Sentiment Score', 'Article Link']]
Top 5 Positive Articles:
Out[56]:
Ticker Date Article Content Sentiment Score Article Link
0 META 2024-11-02 07:45:12 Magnificent Meta and Microsoft 1.000000 https://www.youtube.com/watch%3Fv%3DTB_dDBD-DGQ
1 MSFT 2024-11-02 07:44:14 Microsoft: Most Magnificent, Fairly Valued (NA... 0.733333 https://seekingalpha.com/article/4731992-micro...
2 V 2024-11-04 07:47:15 US election showdown: Latest polls show Harris... 0.500000 https://www.business-standard.com/world-news/u...
3 NVDA 2024-11-04 07:43:50 Nvidia, Apple, Sherwin-Williams, DJT, Talen En... 0.500000 https://www.barrons.com/articles/stock-market-...
4 BAC 2024-11-03 07:49:47 Buffett's Berkshire Hathaway cuts Apple, BofA ... 0.500000 https://www.msn.com/en-us/money/topstocks/buff...
In [57]:
# Top 5 negative articles
top_negative = articles_df.tail(5)
print("\nTop 5 Negative Articles:")
top_negative[['Ticker', 'Date', 'Article Content', 'Sentiment Score', 'Article Link']]
Top 5 Negative Articles:
Out[57]:
Ticker Date Article Content Sentiment Score Article Link
495 META 2024-11-02 07:45:10 Meta: AI Train Isn't Slowing Down Anytime Soon -0.155556 https://seekingalpha.com/article/4732314-meta-...
496 AMZN 2024-11-04 07:37:40 Talen Stock Tumbles on Amazon Nuclear Power Se... -0.155556 https://www.barrons.com/articles/talen-stock-p...
497 PG 2024-11-03 07:47:48 NEET PG Counselling 2024 Schedule: The Medical... -0.163636 https://english.jagran.com/education/neet-pg-c...
498 DIS 2024-11-04 01:50:16 If you're seeing this message, that means Java... -0.200000 https://www.techinasia.com/news/disney-forms-b...
499 PYPL 2024-11-01 07:50:46 PayPal: Just 3 Million Shy Of Making Another R... -0.500000 https://seekingalpha.com/article/4731471-paypa...

Key Takeaways¶

  • GOOGL and ADBE are currently viewed favorably in the media, as evidenced by the high sentiment scores in the stacked chart.
  • JPM and BAC are the top companies in terms of negative coverage, which may impact public perception or investor sentiment toward these companies.
  • Overall, the news sentiment is slightly positive, but the presence of neutral articles suggests a balanced representation of information.
  • Repeat the sentiment analysis using a Naive Bayes classifier. Compare the results from TextBlob and Naive Bayes to select 10 stocks for your portfolio based on the most positive sentiment trends.

Train a Naive Bayes Classifier¶

In [71]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from sklearn.feature_extraction.text import CountVectorizer
from imblearn.over_sampling import RandomOverSampler
In [72]:
articles_df.head(1)
Out[72]:
Ticker Relative Date Date Media Journalist Article Content Article Link Sentiment Score Sentiment Category
0 META 2 days ago 2024-11-02 07:45:12 YouTube N/A Magnificent Meta and Microsoft https://www.youtube.com/watch%3Fv%3DTB_dDBD-DGQ 1.0 Positive

Prepare data for modeling¶

In [ ]:
# Convert article content into feature vectors
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(articles_df['Article Content'])

# Use the TextBlob-labeled sentiment as the target variable
y = articles_df['Sentiment Category']

# Split the data for training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Default MultinomialNB¶

In [97]:
# Train a Naive Bayes classifier
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)

# Evaluate the classifier on the test set
y_pred = nb_classifier.predict(X_test)
print("Classification Report for Naive Bayes Classifier:\n", classification_report(y_test, y_pred))
print("Accuracy Score:", accuracy_score(y_test, y_pred))
Classification Report for Naive Bayes Classifier:
               precision    recall  f1-score   support

    Negative       0.23      0.25      0.24        12
     Neutral       0.80      0.33      0.47        12
    Positive       0.83      0.89      0.86        76

    accuracy                           0.75       100
   macro avg       0.62      0.49      0.52       100
weighted avg       0.75      0.75      0.74       100

Accuracy Score: 0.75
  • With an accuracy of 0.75, the default Naive Bayes classifier performs well for the positive class but struggles with negative and neutral classes due to the class imbalance.
  • This leads to high recall and f1-score for positive sentiment but poor results for minority classes.

Alpha Adjustment (Alpha = 0.5):¶

In [98]:
# Adjust alpha (smoothing parameter)
nb_classifier = MultinomialNB(alpha=0.5) 
nb_classifier.fit(X_train, y_train)
y_pred = nb_classifier.predict(X_test)
print("Classification Report for Naive Bayes Classifier with Alpha Adjustment:\n", classification_report(y_test, y_pred))
print("Accuracy Score:", accuracy_score(y_test, y_pred))
Classification Report for Naive Bayes Classifier with Alpha Adjustment:
               precision    recall  f1-score   support

    Negative       0.27      0.25      0.26        12
     Neutral       0.60      0.50      0.55        12
    Positive       0.85      0.88      0.86        76

    accuracy                           0.76       100
   macro avg       0.57      0.54      0.56       100
weighted avg       0.75      0.76      0.75       100

Accuracy Score: 0.76
  • Adjusting the smoothing parameter alpha improved the accuracy slightly to 0.76.
  • However, the impact on the minority classes was limited, showing only minor gains.

Oversampling Minority Classes¶

In [125]:
# Oversample the training set to balance classes
ros = RandomOverSampler(random_state=42)
X_train_resampled, y_train_resampled = ros.fit_resample(X_train, y_train)

# Train and evaluate with resampled data
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train_resampled, y_train_resampled)
y_pred = nb_classifier.predict(X_test)
print("Classification Report with Oversampling:\n", classification_report(y_test, y_pred))
print("Accuracy Score:", accuracy_score(y_test, y_pred))
Classification Report with Oversampling:
               precision    recall  f1-score   support

    Negative       0.56      0.42      0.48        12
     Neutral       0.53      0.67      0.59        12
    Positive       0.91      0.91      0.91        76

    accuracy                           0.82       100
   macro avg       0.67      0.66      0.66       100
weighted avg       0.82      0.82      0.82       100

Accuracy Score: 0.82
  • By oversampling, the accuracy increased significantly to 0.82.
  • This approach had the most notable improvement on negative and neutral classes, with a boost in both precision and recall for these categories.
  • The model became more balanced across all sentiment classes, making it the most effective solution among the three approaches.

Use best model to predict sentment on the full dataset¶

In [100]:
# Predict sentiment on the full dataset
articles_df['Naive Bayes Sentiment'] = nb_classifier.predict(vectorizer.transform(articles_df['Article Content']))

# Count the distribution of Naive Bayes sentiment predictions
print("\nNaive Bayes Sentiment Distribution:")
print(articles_df['Naive Bayes Sentiment'].value_counts())
Naive Bayes Sentiment Distribution:
Naive Bayes Sentiment
Positive    391
Neutral      62
Negative     47
Name: count, dtype: int64

Compare Results from TextBlob and Naive Bayes¶

In [126]:
# Compare TextBlob and Naive Bayes sentiment labels
comparison_df = articles_df[["Ticker", "Date", "Media", "Article Content", "Article Link", "Sentiment Score", "Sentiment Category", "Naive Bayes Sentiment", "Agreement"]]
comparison_df['Agreement'] = comparison_df['Sentiment Category'] == comparison_df['Naive Bayes Sentiment']
agreement_percentage = comparison_df['Agreement'].mean() * 100

print(f"\nAgreement between TextBlob and Naive Bayes: {agreement_percentage:.2f}%")
print(f"Discrepancies: {comparison_df['Agreement'].value_counts()[1]} out of {comparison_df.shape[0]}")
comparison_df[comparison_df['Agreement'] == False].head()
Agreement between TextBlob and Naive Bayes: 92.60%
Discrepancies: 37 out of 500
Out[126]:
Ticker Date Media Article Content Article Link Sentiment Score Sentiment Category Naive Bayes Sentiment Agreement
3 NVDA 2024-11-04 07:43:50 Barron's Nvidia, Apple, Sherwin-Williams, DJT, Talen En... https://www.barrons.com/articles/stock-market-... 0.5 Positive Neutral False
4 BAC 2024-11-03 07:49:47 MSN Buffett's Berkshire Hathaway cuts Apple, BofA ... https://www.msn.com/en-us/money/topstocks/buff... 0.5 Positive Negative False
5 AAPL 2024-11-04 04:43:26 Seeking Alpha Apple: No Reason For The Love (NASDAQ:AAPL) https://seekingalpha.com/article/4732532-apple... 0.5 Positive Negative False
11 PG 2024-11-04 07:47:45 Republic World Diljit Dosanjh Jaipur Concert: PG Students Sav... https://www.republicworld.com/entertainment/ce... 0.4 Positive Neutral False
13 HD 2024-11-04 04:48:38 Offshore Energy HD HHI kicks off autonomous ship demonstration... https://www.offshore-energy.biz/hd-hhi-kicks-o... 0.4 Positive Neutral False
  • The comparison between TextBlob and Naive Bayes shows a high level of agreement, with 92.60% compatibility in sentiment classification, indicating consistency between the two approaches.
  • That translates to 37 discrepancies out of 500 comparisons.

Select 10 Stocks for the Portfolio Based on Positive Sentiment Trends¶

In [110]:
# Calculate positive sentiment proportions for each stock based on Naive Bayes results
positive_sentiment_df = articles_df[articles_df['Naive Bayes Sentiment'] == 'Positive']
positive_proportions = positive_sentiment_df['Ticker'].value_counts() / articles_df['Ticker'].value_counts()
positive_proportions = positive_proportions.dropna().sort_values(ascending=False)

# Select top 10 stocks for portfolio
top_10_stocks = positive_proportions.head(10)
print("\nTop 10 Stocks for Portfolio based on Positive Sentiment Trends:")
print(top_10_stocks)
Top 10 Stocks for Portfolio based on Positive Sentiment Trends:
Ticker
GOOGL    0.96
PYPL     0.96
ADBE     0.96
NFLX     0.96
MSFT     0.92
META     0.88
AMZN     0.84
UNH      0.84
DIS      0.84
JNJ      0.84
Name: count, dtype: float64

The top 10 stocks for the portfolio, selected based on positive sentiment trends, show strong positive coverage, with Google (GOOGL), PayPal (PYPL), Adobe (ADBE), and Netflix (NFLX) leading at 96% positivity. These companies reflect strong sentiment, suggesting a good market perception and potential investor confidence. Microsoft (MSFT), Meta (META), and Amazon (AMZN) also rank high, with positivity scores above 80%. Such sentiment-driven analysis could signal stability and growth potential, making these stocks attractive candidates for the portfolio.

Problem 4:¶

Upload the ratings.csv data set with movieId, userId, and the rating as columns and do the following.

  • Filter out all movies that have received fewer than 100 ratings. Then, create a utility matrix using the filtered data.
  • Convert the utility matrix to a DataFrame and calculate the number of missing ratings. Also, compute the percentage of missing values in this sparse matrix.
  • Build an SVD-based collaborative filtering recommender system to predict movie ratings. Calculate the RMSE for the model. Additionally, find a userId who has rated a movie and use the model to generate the top 5 movie recommendations for that user.
In [40]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', 20)
In [43]:
# Load the config file
with open('config.json', 'r') as f:
    config = json.load(f)
    
data_loc = config["data_loc"]
file_name = "ratings.csv"
In [44]:
ratings_df = pd.read_csv(data_loc + file_name)
rows, columns = ratings_df.shape
print(f"The dataset contains {rows:,} rows and {columns} columns")
ratings_df.head(5)
The dataset contains 100,836 rows and 3 columns
Out[44]:
userId movieId rating
0 1 1 4.0
1 1 3 4.0
2 1 6 4.0
3 1 47 5.0
4 1 50 5.0
  • Filter out all movies that have received fewer than 100 ratings. Then, create a utility matrix using the filtered data.
In [58]:
# Identifying the movies less than 100 ratings
pop_movies = ratings_df['movieId'].value_counts()
pop_movies = pop_movies[pop_movies > 100].index

# Filtering them out
ratings_filtered = ratings_df[ratings_df['movieId'].isin(pop_movies)]
ratings_filtered
Out[58]:
userId movieId rating
0 1 1 4.0
2 1 6 4.0
3 1 47 5.0
4 1 50 5.0
7 1 110 4.0
... ... ... ...
100217 610 48516 5.0
100310 610 58559 4.5
100326 610 60069 4.5
100380 610 68954 3.5
100452 610 79132 4.0

19788 rows × 3 columns

In [69]:
# Create a utility matrix with users as rows and movies as columns
utility_matrix = ratings_filtered.pivot(index='userId', columns='movieId', values='rating')
utility_matrix[0:5]
Out[69]:
movieId 1 2 6 10 32 34 39 47 50 110 ... 7153 7361 7438 8961 33794 48516 58559 60069 68954 79132
userId
1 4.0 NaN 4.0 NaN NaN NaN NaN 5.0 5.0 4.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN 4.0 4.5 NaN NaN 4.0
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN 2.0 NaN NaN 2.0 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 4.0 NaN NaN NaN NaN 4.0 3.0 NaN 4.0 4.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 134 columns

  • Convert the utility matrix to a DataFrame and calculate the number of missing ratings. Also, compute the percentage of missing values in this sparse matrix.
In [77]:
utility_matrix_df = pd.DataFrame(utility_matrix)
num_missing_ratings = utility_matrix_df.isna().sum().sum()
num_missing_ratings
Out[77]:
60210
In [78]:
# Calculate the total number of values in the matrix
total_values = utility_matrix.size
total_values
Out[78]:
79998
In [82]:
# Calculate the percentage of missing values
percent_missing = round((num_missing_ratings / total_values) * 100, 2)

# Display the results
print("Number of missing ratings:", num_missing_ratings)
print(f"Percentage of missing values: {percent_missing}%")
Number of missing ratings: 60210
Percentage of missing values: 75.26%

Utility Matrix Analysis¶

The utility matrix shows significant sparsity in the dataset, with 60,210 missing ratings, representing 75.26% of the data.

This high percentage of missing values is typical in real-world recommendation systems, where users interact with only a small fraction of available items. Such sparsity is a challenge but also highlights the value of collaborative filtering methods like SVD, which can predict missing ratings by uncovering latent patterns in user behavior and item characteristics.

The utility matrix’s structure allows us to systematically handle and analyze these missing values, making it a crucial tool for effective recommendations despite incomplete data.

  • Build an SVD-based collaborative filtering recommender system to predict movie ratings. Calculate the RMSE for the model. Additionally, find a userId who has rated a movie and use the model to generate the top 5 movie recommendations for that user.
In [86]:
from surprise import Dataset, Reader
from surprise import SVD
from surprise.model_selection import train_test_split, cross_validate
from surprise import accuracy
from surprise.accuracy import rmse

Model Training and Evaluation¶

In [92]:
# Define a Reader with the appropriate rating scale
reader = Reader(rating_scale=(ratings_df['rating'].min(), ratings_df['rating'].max()))
data = Dataset.load_from_df(ratings_df[['userId', 'movieId', 'rating']], reader)

# Train-Test Split
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

# Initialize and Train the SVD Model
svd = SVD(n_epochs=20, lr_all=0.005, reg_all=0.2)
svd.fit(trainset)

# Evaluate the Model
predictions = svd.test(testset)
rmse_score = rmse(predictions)
print("RMSE:", rmse_score)
RMSE: 0.8805
RMSE: 0.8805328477373325

Cross-Validation¶

In [102]:
cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)
Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8689  0.8703  0.8709  0.8856  0.8747  0.8741  0.0061  
MAE (testset)     0.6707  0.6711  0.6713  0.6794  0.6742  0.6733  0.0033  
Fit time          0.91    1.02    0.87    1.07    0.88    0.95    0.08    
Test time         0.31    0.20    2.42    0.10    0.10    0.62    0.90    
Out[102]:
{'test_rmse': array([0.86887032, 0.870294  , 0.87085634, 0.88561234, 0.87468257]),
 'test_mae': array([0.67072436, 0.67109641, 0.67125213, 0.67938632, 0.67419942]),
 'fit_time': (0.9120039939880371,
  1.0182032585144043,
  0.8686258792877197,
  1.0672190189361572,
  0.8846390247344971),
 'test_time': (0.306441068649292,
  0.20035314559936523,
  2.415268898010254,
  0.0992732048034668,
  0.09607696533203125)}

Cross-Validation Analysis¶

The evaluation of the SVD algorithm across 5 folds shows consistent performance, with an average RMSE of 0.8741 and an MAE of 0.6733.

These metrics indicate that the model’s predictions are reasonably accurate, though some variability exists (as seen in the standard deviations). RMSE, being slightly higher, suggests the presence of occasional larger errors, whereas the lower MAE indicates generally stable performance with smaller average errors.

These results suggest that SVD provides a robust foundation for accurate recommendations, though further tuning may reduce occasional outliers in predictions.

User Recommendations¶

In [100]:
# Select a user who has rated movies
user_id = ratings_df['userId'].sample(1).iloc[0]  # Randomly selecting a userId for demonstration

# Get all movies and filter for those the user has not rated
all_movie_ids = ratings_df['movieId'].unique()
rated_movie_ids = ratings_df[ratings_df['userId'] == user_id]['movieId'].values
unrated_movie_ids = [movie_id for movie_id in all_movie_ids if movie_id not in rated_movie_ids]

# Predict ratings for movies the user hasn't rated
predictions = [svd.predict(user_id, movie_id) for movie_id in unrated_movie_ids]

# Sort predictions by estimated rating in descending order and select top 5
top_5_recommendations = sorted(predictions, key=lambda x: x.est, reverse=True)[:5]


print(f"Top 5 movie recommendations for user {user_id}:")
for pred in top_5_recommendations:
    print(f"Movie ID: {pred.iid}, Predicted Rating: {pred.est:.2f}")
Top 5 movie recommendations for user 414:
Movie ID: 898, Predicted Rating: 4.06
Movie ID: 1283, Predicted Rating: 4.04
Movie ID: 3435, Predicted Rating: 4.04
Movie ID: 2160, Predicted Rating: 4.02
Movie ID: 306, Predicted Rating: 4.00

AI Tools Reflection.¶

For this homework, I primarily relied on ChatGPT as my AI tool.

It often helps me a lot to understand and break down complex concepts into smaller, manageable parts, which I could then piece together to improve my understanding. The explanations are detailed and easy to follow, and this approach significantly enhanced my learning experience lately.

Additionally, ChatGPT helps a lot on my writing. It helps me to articulate my ideas more clearly and concisely. There was a particular instance where I spent about 25 minutes trying to formulate an explanation. After using ChatGPT to refine it, the core meaning of my original response remained intact, but the clarity and brevity were improved a lot. And every time I use it, I ask tips to improve my writting as well.

Another area where ChatGPT helped was with data visualization. Previously, I would spend considerable time searching for the right code on platforms like StackOverflow to generate plots. With ChatGPT, I was able to speed up this process, quickly getting the necessary plotting code and visualizing the data more efficiently.